Homework 2 sample solutions

DKU Stats 101 Fall 2024 Session 1

Author

Anonymous

Published

September 16, 2024

Photo courtesy of Dialectic Engineering1

1. Summarize the data (10 Points)

To enter a new market, it’s essential to understand the landscape. By summarizing the existing data on Western EVs, you can provide your employer with an overview of the key performance metrics of the competition.

Task 1a: Summarize the quantitative variables

Summarize the quantitative variables in the dataset using appropriate summary statistics. Create a well-organized table to present these summary statistics.

Summary of quantitative data
Mean Std. Dev.
Acceleration (seconds to 100 kph) 7.4 3.0
Top speed (kph) 179.2 43.6
Efficiency (wh per km) 189.2 29.6
Range (km) 338.8 126.0
Fast charge (kph) 456.7 201.3
Seats 4.9 0.8
Price (euros) 55811.6 34134.7

There are many ways to make this table, this is just one example. As long as there is some measure of center and spread of the quantitative variables, that is enough.

Task 1b: Visualize the distribution of body styles

Use an appropriate plot to visualize the distribution of BodyStyle in the dataset. Describe what you observe about the most common body styles.

Body style distribution

SUVs and then hatchbacks are by far the most common body style. To me, this is a bit surprising as hatchbacks are a somewhat unusual body type for gasoline-powered cars. Sedans, which, in most countries, are the most common car type, are a distant third place. This suggests, to me, that perhaps the nature of EV cars may be different than gasoline-powered cars.

Task 1c: Fast Charging capability

What percentage of the vehicles in the dataset support RapidCharge? For those that do, what is the average FastCharge_KmH?

Fast charging statistics
% of cars with rapid charging If has rapid charging, mean fast charge rate (kph)
95.15 456.73

There are many ways to make this table, this is just one example.

Task 1d: Identify the car with the highest top speed

Identify which car in the dataset has the highest top speed (TopSpeed_KmH). Report the car’s name, brand, and top speed.

Car with the higest top speed
Model Brand Top speed (kph)
Roadster Tesla 410

There are many ways to make this table, this is just one example.

Task 1e: Explore the distribution of top speed by power train

Investigate how top speed varies depending on power train type (PowerTrain) by creating an appropriate plot. Discuss any patterns or trends you observe, and provide an explanation for these trends.

Top speed variation by power train

It seems that all wheel drive (AWD) electric cars have the largest spread around top speed and have the highest average top speed. There exists one outlier (the Tesla Roadster previously identified). Front wheel drive (FWD) cars have the lowest amount of spread and also the lowest average top speed, while rear wheel drive (RWD) is in the middle of both. One possibility is that FWD cars are some kind of basic or budget car type, RWD cars are a little more powerful, and the most powerful/fastest car type are all wheel drive. This makes some sense as it is probably more expensive to power four wheels instead of only two.

2. Relationship between two variables (15 Points)

Understanding the relationship between key performance metrics like speed and acceleration can reveal important insights about what makes certain vehicles stand out. Your employer will be interested in knowing how these factors interact and whether there are trade-offs to be aware of.

Task 2a: Calculate the correlation between top speed and acceleration

Calculate the correlation coefficient between top speed (TopSpeed_KmH) and acceleration (AccelSec). Explain what the correlation coefficient tells you about the relationship between these two variables.

Correlation between top speed and acceleration
x y r
Top speed (kph) Acceleration (seconds to 100 kph) -.79

There are many ways to make this table, this is just one example. The correlation coefficient (assuming the conditions for correlation are met) indicates that there is quite a strong relationship between top speed and acceleration. This result makes sense as usually cars with powerful engines are capable of both fast acceleration and have high top speeds.

Task 2b: Create a scatterplot of top speed vs. acceleration

Create a scatterplot to visualize the relationship between top speed and acceleration. Identify any potential outliers and discuss their impact on the relationship.

Scatterplot of top speed vs. acceleration

Need to identify the specific model of the cars that are outliers, either by directly labeling the plot or by some other means. In this case, the outliers may have high leverage (since they are far from the mean of x) but it is unclear how much influence they have. Overall, the relationship (excluding the outliers) is negative and relatively straight, as we might expect from the correlation calculation.

Task 2c: Add a LOESS smoother to the scatterplot

Improve the scatterplot by adding a LOESS smoother. Add a confidence interval to the LOESS smoother and explain why the confidence interval is larger at slower levels of acceleration.

Without confidence interval

With confidence interval

Smoothed relationship between top speed and acceleration

Any reasonable guess as to the confidence interval can be offered here.

Task 2d: Build a bivariate regression model

Create a bivariate regression model where top speed is predicted by acceleration. Interpret the model’s coefficients and discuss what they tell you about the relationship between these two variables.

Modeling top speed with acceleration
Top speed model
(Intercept) 263.162
(7.088)
Acceleration (seconds to 100 kph) −11.353
(0.888)
N 103
R2 0.62
Residual standard deviation 27

There are many ways to make this table, this is just one example. First, the intercept indicate that when the acceleration ability is zero, the top speed is 263 kp/h, which is nonsense. The coefficient on acceleration indicates that for each additional second it takes to reach 100 kph from 0, the predicted top speed of the car decreases by 11 kph. To me, this indicates a significant effect, as going from 10 seconds to 100 kph to 5 seconds decreases the expected top speed by over 55 kph. Generally speaking, then, the faster a car accelerates (the less time it takes to reach 100 kph) the higher the top speed, according to the model. This result makes sense - sports car type cars generally want to go quickly and fast.

3. Relationship between multiple variables (15 Points)

Range anxiety is a common concern among potential EV buyers. Your employer will want to know how factors like efficiency and power train type affect the range of a vehicle, as well as how price plays into this equation.

Task 3a: Model range as a function of efficiency

Create a regression model that predicts the Range_Km of an EV based on its efficiency (Efficiency_WhKm). Report the model’s coefficients and interpret them.

Modeling range
Range model
(Intercept) 86.376
(77.106)
Efficiency (wh per km) 1.334
(0.403)
N 103
R2 0.10
Residual standard deviation 120

There are many ways to make this table, this is just one example. First, the intercept indicate that when efficiency is zero, the range is 86.3 km, which is nonsense. The coefficient on efficiency indicates that for each additional watt hour per kilometer of efficiency, the predicted range of the car increases by 1.3 kilometers. To me, this is a bit of a strange result, because normally lower numbers are better for efficiency. We would suspect that the more efficient a car, the longer the range so the two should be negatively related. The result here might be a consequence of the fact that larger cars can hold larger batteries, permitting greater range.

Task 3b: Add PowerTrain to the model

Extend the previous model by adding PowerTrain as an additional predictor. Describe how the inclusion of PowerTrain changes the model’s coefficients and interpretation.

Modeling range
Range model
 (1)   (2)
(Intercept) 86.376 390.864
(77.106) (84.547)
Efficiency (wh per km) 1.334 0.172
(0.403) (0.401)
Powertrain: FWD −152.850
(26.783)
Powertrain: RWD −122.532
(28.525)
N 103 103
R2 0.10 0.33
Residual standard deviation 120 104
Reference category for Powertrain: AWD

There are many ways to make this table, this is just one example. We can see by adding the Powertrain variable that the efficiency variable subsantially decreases in magnitude. In this case, the intercept is interpreted as the case when efficiency is zero and the car has a powertrain type of AWD. Having any other type of powertrain results in a very large reduction in range, from -120 to -150. In the previous section, we saw that AWD cars had much higher average top speed. So these cars may be more expensive/larger/powerful, accounting for their greater range.

Task 3c: Visualize range vs. efficiency by PowerTrain

Create a scatterplot to visualize the relationship between range and efficiency, with the points colored according to PowerTrain. Discuss whether this plot changes your interpretation of the model.

Efficiency as a function of range, colored by powertrain

Plot must have range on y axis, efficiency on x axis. Overall, the plot helps confirm some of the conjecture offered in the previous answer. RWD and FWD do not seem to have any major differences, mostly clustering the same area of the plot. AWD type cars seem to be of a different pattern, located largely in the upper right hand quadrant of the plot. There are several outliers of the type AWD. Overall, it seems AWD cars are somehow importantly different in overall design or construction compared FWD or RWD cars.

Task 3d: Add price to the model

Extend the model by adding PriceEuro as another independent variable. Compare this model to the earlier models, and discuss any changes in the coefficients and interpretation.

Modeling range
Range model
 (1)   (2)   (3)
(Intercept) 86.376 390.864 268.596
(77.106) (84.547) (78.158)
Efficiency (wh per km) 1.334 0.172 −0.027
(0.403) (0.401) (0.357)
Powertrain: FWD −152.850 −64.632
(26.783) (28.848)
Powertrain: RWD −122.532 −42.439
(28.525) (29.322)
Price (euros) 0.002
(0.000)
N 103 103 103
R2 0.10 0.33 0.48
Residual standard deviation 120 104 92
Reference category for Powertrain: AWD

By adding price to the model, we can see it significantly decreases the size of the coefficients of powertrain on efficiency. The very small coefficient on efficiency is now negative but either way, the variable only has a very small impact on range. For price, every 1000 Euro increase in price increases predicted range by 2 kilometers. Overall, this effect seems medium. A 20000 Euro difference in price changes predicted range by 40 kilometers, which is not insignificant but there appear to be other factors that must also matter.

4. Model fit (10 Points)

A good model fit is crucial for making reliable predictions and understanding the underlying relationships in the data. Your employer will want to know which models are most reliable for identifying key performance metrics.

Task 4a: Compare model fit using R-squared

Compare the fit of the models from Question 3 using R-squared values. Which model fits the data best, and why?

Modeling range
Range model
 (1)   (2)   (3)
(Intercept) 86.376 390.864 268.596
(77.106) (84.547) (78.158)
Efficiency (wh per km) 1.334 0.172 −0.027
(0.403) (0.401) (0.357)
Powertrain: FWD −152.850 −64.632
(26.783) (28.848)
Powertrain: RWD −122.532 −42.439
(28.525) (29.322)
Price (euros) 0.002
(0.000)
N 103 103 103
R2 0.10 0.33 0.48
Residual standard deviation 120 104 92
Reference category for Powertrain: AWD

If we examine the models from Task 3d again, we can see that the model with only efficiency explains a relatively small percentage of total variance in range - 10%. The model significantly improves with the addition of the variable powertrain and improves even more with the addition of price. However, given that we have not fully checked the conditions for regression, some caution should be used in relying solely on the r squared.

Task 4b: Create a residuals histogram

Create a histogram of residuals plot for the best-fitting model. Discuss any patterns you observe in the residuals and what they indicate about the model’s fit.

As a reminder, the main regression condition that can be assessed with a histogram of the residuals is whether they are unimodal and symmetric. In this case, they are somewhat unimodal and symmetric but with a few outliers. I would classify this as a not too bad histogram of residuals but some attention ought to be paid to the outliers. We can also see that the average miss amount of our prediction (i.e. residual size) is around 100 km, which I would classify as a relatively large average miss.

Task 4c: Make a partial regression plot

Create partial regression plots for the model in Task 3d, looking at the relationships between Range_Km and its (quantitative) predictors, Efficiency_WhKm and PriceEuro. Interpret the plot.

Efficiency (wh per km)

Price (euros)

Partial residual plots of model (3)

We can see in the first plot regarding efficiency, that, after controlling for the other variables in the model, there is little relationship between efficiency and range. In the second plot, after controlling for the other variables, we can see a strong relationship between price and range. However, it is worth noting that some of the outliers appear to have high leverage and possibly high influence. Most of the data is at lower values of price and the relationship overall appears to be logarithmic. We should consider using log(price) to improve the model.

5. Model assumptions (10 Points)

It’s important to ensure that the models you use are based on sound statistical assumptions. Your employer will want to know whether the conclusions drawn from the models are reliable.

Task 5a: Evaluate model assumptions

Evaluate whether the best-fitting model from Question 4 satisfies the regression assumptions outlined in Chapter 9.3. You can rely on the plots you’ve already made and create new ones. Provide a thorough explanation for each assumption and whether it holds in this model.

Linearity Assumption

We first want to check the variables against the predictor variable individually.

Efficiency vs Range

Price vs Range

Bivariate linearity check

We can see here that efficience is reasonably linear against range, while price appears to be logarithmicly related to range. Next we should examine the residual vs. predicted data.

Residual vs predicted for model (3)

In this case, we can again see some evidence of linearity problems as evidenced by some of the curved pattern in the middle range of predicted values. Finally, we can re-examine the partial residual plots.

Efficiency (wh per km)

Price (euros)

Partial residual plots of model (3)

Again, the plot of price vs. range suggests that we should probably be using the values of log(price) in the regression. Overall, I would suggest that this regression fails the linearity condition.

Equal Variance Assumption

Residual vs predicted for model (3)

If we re-examine the residual plot, we can see that the variance in the residuals is relatively constant across the range of predicted values. There are a few outliers that do not follow this pattern but I think we can say this assumption has been satisfied.

Check the Residuals

Histogram of residuals of model (3)

As mentioned before, the histogram of the residuals appears to be somewhat unimodal and symmetric though with some real outliers. I would classify this condition as being mostly met.

6. Outliers and lurking variables (10 Points)

Outliers can skew results and lead to misleading conclusions, while lurking variables can create spurious relationships. Your employer will want to ensure that the analysis is robust to these issues.

Task 6a: Identify and remove outliers

Identify any outliers in the dataset that might be influencing your regression models. Remove these outliers and rerun the regression analysis. Discuss how the results change and identify which outliers were particularly influential.

Large residual outliers
Name Range (km) Efficiency (wh per km) Powertrain Price (euro) Residual
Tesla Cybertruck Tri Motor 750 267 AWD 75000 342.3829
Tesla Roadster 970 206 AWD 215000 287.8708
Porsche Taycan Turbo S 375 223 AWD 180781 -239.9765
Smart EQ fortwo cabrio 95 176 RWD 24565 -174.3156
Porsche Taycan Cross Turismo 385 217 AWD 150000 -170.1418
Smart EQ forfour 95 176 RWD 22030 -169.3746
Smart EQ fortwo coupe 100 167 RWD 21387 -163.3627
Porsche Taycan Turbo 390 215 AWD 148301 -161.8839
Nissan Ariya 87kWh 440 198 FWD 50000 143.8920
Lucid Air 610 180 AWD 105000 141.5758

There are several ways to identify outliers. The first is to check to see which observations have the largest residuals. From this table, we can see the largest residual belongs to the Tesla Cybertruck. After doing some research online, it seems that the Cybertruck can only achieve this range with an optional range extender that costs an extra ~$15000 and uses up cargo space in the vehicle. It seems reasonable to exclude the Cybertruck as a mistake in this case. The Tesla Roadster will have the range indicated and is not a mistake but the car is a niche sports car. If your company does not sell these types of cars it might be reasonable to exclude the Roadster. One could make a similar argument for the Porche models though these cars are starting to enter the realm of cars that a normal person can buy, so therefore it may be better to include them.

Efficiency (wh per km)

Price (euros)

Partial residual plots with outliers noted

Unfortunately there are no easy ways to visually label outliers in a partial residual plot, but we can again see that, based on the x value of the plotted points, the Cybertruck and Roadster are notable outliers in the efficiency and price partial residual plots as well.

evs.nooutliers <- evs %>% 
  filter(full.name!="Tesla Cybertruck Tri Motor" & full.name!="Tesla Roadster ")
Modeling range with outliers removed
Full dataset
Outliers removed
 (1)   (2)   (3)   (4)   (5)   (6)
(Intercept) 86.376 390.864 268.596 156.999 432.249 363.005
(77.106) (84.547) (78.158) (65.612) (68.255) (68.215)
Efficiency (wh per km) 1.334 0.172 −0.027 0.911 −0.137 −0.276
(0.403) (0.401) (0.357) (0.345) (0.326) (0.313)
Powertrain: FWD −152.850 −64.632 −140.037 −89.232
(26.783) (28.848) (21.275) (25.359)
Powertrain: RWD −122.532 −42.439 −108.215 −62.389
(28.525) (29.322) (22.645) (25.576)
Price (euros) 0.002 0.001
(0.000) (0.000)
N 103 103 103 101 101 101
R2 0.10 0.33 0.48 0.07 0.37 0.43
Residual standard deviation 120 104 92 99 83 79
Reference category for Powertrain: AWD

Overall, removing the outliers did produce some changes in the models. It increased the importance of powertrain in the model with all three terms and decreased the importance of price. It also increased the importance of efficiency and the sign of the coefficient is in the direction we initially expected - more efficient cars have a longer predicted range. Note that the R squared of the full model actually decreased with the removal of the outliers - it seems possible that the regression line was being “pulled” toward the outliers and this created the impression of greater model quality than actually existed.

Task 6b: Consider lurking variables

Consider any variables not included in the dataset that you think might be missing from the model. Explain your reasoning for why these variables could be important and how they might affect the model.

Any reasonable set of variables is ok here as long as you provide a quality justification for why they might improve the prediction of range.

7. Prediction (10 Points)

Your employer may want to predict the performance of their vehicles under different scenarios. Being able to make accurate predictions is key to making informed strategic decisions.

Task 7a: Predict the range of a specific car

Using the model from Task 3d, predict the range of an electric car with the following characteristics:

  • Efficiency_WhKm: 200 Wh/km
  • PowerTrain: FWD
  • PriceEuro: 50,000 Euros
# Intercept
intercept <- rg.eff.power.price.model$coefficients[1]
# Efficiency coefficient
eff.coef <- rg.eff.power.price.model$coefficients[2]
# Powertrain FWD coefficient
fwd.coef <- rg.eff.power.price.model$coefficients[3]
# Price coefficient
price.coef <- rg.eff.power.price.model$coefficients[5]

pred.y <- intercept + eff.coef*200 + fwd.coef*1 + price.coef*50000 

unname(round(pred.y, digits=2))
[1] 296.05

Other ways of calculating this are ok too as long as work is shown.

Task 7b: Visualize predicted values

Create a plot of the predicted range values for an electric car with an Efficiency_WhKm of 200 Wh/km and a PowerTrain of FWD, while varying the price across its Range_Km. Interpret the plot.

Visualized predicted values of range

There are a number of R packages that can help you create this plot but you can easily create it yourself by adding to the intercept the value of the efficiency coefficient * 200 and the coefficient of FWD*1. We then set that result as the new intercept. Overall, we can see that moving through the range of price produces a relatively large change in predicted range, meaning that the variable is substantively signficant in the model.

8. Re-expression (5 Points)

Sometimes, transforming variables can improve model fit and make relationships clearer. Your employer wants you to try a log transformation on the price variable to see if it improves the model.

Task 8a: Log-transform the price variable

Re-express the model from Task 3d by log-transforming the PriceEuro variable. Compare the new model to the original model in terms of both the coefficients and model fit. Explain the logic behind this particular transformation and discuss which model you prefer and why.

Modeling range with log price
Range model
 (1)   (2)
(Intercept) 268.596 −1579.289
(78.158) (312.221)
Efficiency (wh per km) −0.027 −0.377
(0.357) (0.348)
Powertrain: FWD −64.632 −20.415
(28.848) (30.412)
Powertrain: RWD −42.439 −7.364
(29.322) (29.854)
Price (euros) 0.002
(0.000)
Log price (euros) 185.117
(28.566)
N 103 103
R2 0.48 0.53
Residual standard deviation 92 88
Reference category for Powertrain: AWD

The log price coefficient appears to also be a meaningful coefficient for predicting range. Every 1% increase in price increases range by about 2 km. From this table, we can also note that the standard deviation of the residuals (a type of measure of the average “miss” or residual size) decreased slightly and the R squared increased slightly. Log transforming price decreased the importance of the powertrain coefficients but increased the size of the efficiency coefficients.

Residual plots of model with price compared to model with log(price)

We can now see that the residual non-linearity identified in Question 4 has largely been resolved by taking the log of price.

Histogram of residuals of model with price compared to model with log(price)

On the other hand, taking the log of price only marginally improves the symmetry of the residuals histogram.

Partial residual plots of model with price compared to model with log(price)

The partial residual plot most clearly indicates how the model fit has improved; the relationship with price is much more obviously linearized as a a result of taking the log of price.

Overall, it’s clear that the model with log price is superior on most dimensions, though it is harder to interpret the coefficient without experience in thinking in log terms.

9. Independent analysis (15 Points)

Task 9a: Research and add Chinese EVs to the dataset

The dataset is largely missing EVs from the top Chinese carmakers. Do some research and select five Chinese EVs of your choice. Manually add their Range_Km, Efficiency_WhKm, PowerTrain, and PriceEuro to the dataset. You can leave the other variables as NA.

Task 9b: Analyze the Chinese EVs

Compare the performance of the Chinese EVs you added to the existing cars in the dataset. Repeat some of the analyses from the previous questions (your choice) and discuss whether the results change after including the Chinese EVs.

Answers will vary but should be in line with the level of analysis of the previous questions.